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Abstract —In this work, we present hardware and software 
implementations of flexible polar systematic encoders and de¬ 
coders. The proposed implementations operate on polar codes of 
any length less than a maximum and of any rate. We describe the 
low-complexity, highly parallel, and flexible systematic-encoding 
algorithm that we use and prove its correctness. Our hardware 
implementation results show that the overhead of adding code 
rate and length flexibility is little, and the impact on operation 
latency minor compared to code-specific versions. Finally, the 
flexible software encoder and decoder implementations are also 
shown to be able to maintain high throughput and low latency. 

Index Terms —polar codes, systematic encoding, multi-code 
encoders, multi-code decoders. 

I. Introduction 

Modern communication systems must cope with varying 
channel conditions and differing throughput and transmission 
latency constraints. The 802.11-2012 wireless communica¬ 
tion standard, for example, requires more than twelve error- 
correction configurations, increasing implementation complex¬ 
ity Q, El. Such a requirement necessitates encoder and 
decoder implementations that are flexible in code rate and 
length. 

Polar codes achieve the symmetric capacity of memory¬ 
less channels with an explicit construction and are decoded 
with the low-complexity successive-cancellation decoding al¬ 
gorithm El- In this paper, we show that apart from the 
above favorable properties, polar codes are highly amenable to 
flexible encoding and decoding. That is, their regular structure 
enables encoder and decoder implementations that support any 
polar code of any rate and length, under the constraint of a 
maximal codeword length. 

Systematic polar coding was described in Q as a method to 
ease information extraction and improve bit-error rate without 
affecting the frame-error rate. The systematic encoding scheme 
originally proposed in Q is serial by nature, and seems non¬ 
trivial to parallelize, unless restricted to a single polar code 
of fixed rate and length. The serial nature of this encoding 
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(0{n ■ logn) time-complexity, where n is the code length) 
places a speed limit on the encoding process which gets 
worse with increasing code length. To address this, a new 
systematic encoding algorithm that is easy to parallelize was 
first described in 0. This algorithm is both parallel and 
flexible in code rate. In this work, we extend the flexibility 
to code length as well and provide hardware and software 
implementations that achieve throughput values of 29 Gbps 
and 10 Gbps, respectively. 

We dedicate a portion of this work to proving the correctness 
of the systematic encoding algorithm presented in 0 . We 
prove that it results in valid systematic polar codewords when 
the sub-matrix of the encoding matrix with rows and columns 
corresponding to information bit indices is an involution. 
We prove that this condition is satisfied for both polar and 
Reed-Muller codes since they both satisfy a property we call 
domination contiguity, which we prove is a sufficient condition 
for the involution to be true. 

This paper is organized into two parts addressing flexible 
encoding and decoding, respectively. The first part starts with 
Section m where we define some preliminary notation and 
contrast the implementation of the original systematic encoder 
presented in 0 with that of 0. Note that reading 0 or 
0 is not a prerequisite to reading the current paper, since 
we summarize the key points needed from those papers in 
Section m Section [HI] is mainly about setting notation and 
casting the various operations needed in matrix form. In 
Section |IV] we define the property of domination contiguity, 
and prove that our algorithm works—in both natural and 
bit-reversed modes—if this property is satisfied. The fact 
that domination contiguity indeed holds for polar codes is 
proved in Section[V] With correctness of the algorithm proved, 
flexible hardware and software systematic encoder implemen¬ 
tations are presented in Sections |VT] and IVIII 

The second part of this paper deals with flexibility of 
decoders with respect to codeword length. Sections IVIIII and 
BXl discuss such flexibility with respect to hardware and 
software implementations of the state-of-the-art fast simplified 
successive-cancellation (Fast-SSC) decoding algorithm, re¬ 
spectively. The rate and length flexible hardware implementa¬ 
tions we present have the same latency and throughput as their 
rate-only flexible counterparts and incur only a minor increase 
in complexity. The proposed flexible software decoders can 
achieve 73% the throughput of the code-specific decoders. 

We would like to mention that some of the proofs presented 
in this paper were arrived at independently by Li et al. in 
0 and O- Specifically, the result that is most relevant to 
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our setting in m is Theorem 1 thereof as well as the two 
corollaries that follow. The closest analog in our paper to 
these results is what one can deduce by combining equations 
(I 22 I 1 . dZTl l. and (|28] |. However, in contrast to our proof is 
more general since we do not limit ourselves to constructing 
the polar code via the Bhattacharyya parameter. Also, the 
results of Q are not used in that paper for efficient systematic 
encoding. A systematic encoder based on these results was 
given later in Q, although that encoder is not as amenable 
to flexible parallel implementation as the encoder proposed in 
||5l and this paper. We also note that Proposition 3 of 171 is 
analogous to our Theorem[T] although the proofs are different. 

II. Background 

We start by defining what we mean by a “systematic 
encoder”, with respect to a general linear code. For integers 
0 < k < n, let G = Gkxn denote a k x n binary matrix with 
rank k. The notation G is used to denote a generator matrix. 
Namely, the code under consideration is 

span(G) = {v G I V e GF(2)'=} . 

An encoder 

£: GF(2)'= ^ span(G) 

is a one-to-one function mapping an information bit vector 
u = {uo, Ml,..., Mfc-i) e GF(2)'" 


to a codeword 

X = {xq,xi,. . .,Xn-i) e span(G) . 

All the encoders discussed in this paper are linear. Namely, 
all can be written in the form 


f (u) = u • n • G, (1) 

where H = Hkxk is an invertible matrix defined over GF(2). 

The encoder £ is systematic if there exists a set of k 
systematic indices 

S = {sj}']Zo . 0 < So < Si < • • • < Sfe_i < n - 1 , (2) 

such that restricting f(u) to the indices S yields u. Specifi¬ 
cally, position Si of x must contain Ui. Note that our definition 
of “systematic” is stronger than some definitions. That is, apart 
from requiring that the information bits be embedded in the 
codeword, we further require that the embedding is in the 
natural order; Ui appears before Uj if i < j. 

Since G has rank k, there exist k linearly independent 
columns in G. Thus, we might naively take H as the inverse 
of these columns, take S as the indices corresponding to these 
columns, and state that we are done. Of course, the point 
of Q and SI is to show that the calculations involved can 
be carried out efficiently with respect to the computational 
model considered. We now briefly present and discuss these 
two solutions. 


A. The Arikan systematic encoder ^ 

Recall 13 that a generator matrix of a polar code is obtained 
as follows. We define the Arikan kernel matrix as 


F = 


1 

1 


0 

1 


(3) 


The TO-th Kronecker product of F is denoted 
defined recursively as 

Q 

pigi(m-l) p0{m-l) 


^(g)m _ 


and is 


where = F . (4) 


From this point forward, we adopt the shorthand 
m = log 2 n . 


In order to construct a polar code of length n = 2™, we apply a 
bit-reversing operation 13 to the columns of F®*". From the 
resulting matrix, we erase the n — k rows corresponding to 
the frozen indices. The resulting kxn matrix is the generator 
matrix. 

A closely related variant is a code for which the column 
bit-reversing operation is not carried out. We follow 11 and 
present the encoder there in the context of a non-reversed polar 
code. Let the complement of the frozen index set be denoted 
by 

^ }j=o I 0 < ao < ai < • • ■ < ak-i < n - 1 . (5) 

The set A is termed the set of active rows. 

A simple observation which is key is that the matrix F®"* 
is lower triangular with all diagonal entries equal to 1. This is 
easily proved by induction using the definition of F and (01). 
An immediate corollary is the following. Suppose we start with 
F®m keep only the rows indexed by A (thus obtaining 
the generator matrix G). From this matrix, we keep only the k 
columns indexed by A. We are left with a kxk lower triangular 
matrix with all diagonal entries equal to 1. Specifically, we are 
left with an invertible matrix. In the setting of ([T]) and (l2]) we 
have that H is the inverse of this matrix and the set S of 
systematic indices simply equals A. 

As previously mentioned, the above description is not 
enough; we must show an efficient implementation. We now 
briefly outline the implementation in 01, which results in an 
encoding algorithm running in time 0{n ■ log n). Let us recall 
our goal, we must And a codeword x = {xo,xi,... ,Xn-i) 
such that, using the notation in (|3, we have that Xo^ equals 
Ui- Since x is a codeword, it is the result of multiplying 
the generator matrix G by some length-fc vector from the 
left. As mentioned, G is obtained by removing from F®"* 
the rows whose index is not contained in A. Thus, we 
can alternatively state our goal as follows. We must And a 
codeword x as described above such that x = v • F®"*, where 
V = (mq, Vi, , Vn-i) is such that Vi = 0 whenever i ^ A. 

We will now show a recursive implementation to the sys¬ 
tematic encoding function Encodem(u, A). Let us start by 
considering the stopping condition, m = 0. As a preliminary 
step, define F®° = 1, a 1 x 1 matrix. Note that this definition 
is consistent with ([3 and (01 for m = 1. Next, note that if 
m = 0, then the problem is trivial; if A is empty than we 
are forced to have v = (0), and thus Encodeo(u,A) returns 
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X = (0), the all-zero codeword. Otherwise, A = {0} and we 
simply take v = (ug) and return x = (ug), as prescribed. 

Let us now consider the recursion itself. Assume m > 1 
and write v = (v',v"), where v' consists of the first n/2 
entries of v and v" consists of the last n/2 entries of v. Let 
X = (x',x") be defined similarly. By the block structure in 
dUl, we have that 

x" = v" • (6) 

x' = v' • f + x" . (7) 

We will find x by first finding x" and then finding x'. Towards 
that end, let 

A' = {a: a £ A and a < n/2}, 

A" = {a — n/2 •. a € A and a > n/2}. 

Finding x" is a straightforward recursive process. Namely, 
by (|6]l if we define u" = , then x" = 

Encodem-i(u", A"). Now, with x" calculated, we can find 
x'. Namely, considering (|2]l, we need a v' for which entry 
ai of v' ■ equals m + x".. Thus, defining u' = 

{ui +x"Jig^', we have that x' = Encodem_i(u', A'). 

The main point we want to stress about the above encoder 
is the serial nature of it: first calculate x" and only after that 
is done, calculate x'. 

A parallel, higher complexity implementation of the algo¬ 
rithm in i), when the frozen bits are set to 0, can calculate 
the parity bits directly using matrix multiplication: 

where i/f™ is a sub-matrix containing rows and columns 
of F’iS'”! corresponding to information-bit indices. Similarly, 
Fam contains the rows and columns of F®"* that corre¬ 
spond to information and frozen bit indices, respectively. The 
dimensions of F®™ and change with code rate, in 

contrast to our encoder which always uses the fixed F®™. A 
parallel multiplier that can accommodate matrices of varying 
dimension leads to a significant increase in implementation 
complexity. 

B. The systematic encoder 

We now give a high-level description of the encoder in JS) . 
As before, we consider a non-reversed setting. Recall that A 
in (|5]l is the set of active row indices. 

1) We first expand u = (ug, ui,..., Uk-i) into a vector vj 
of length n as follows: for all 0 < f < fc we set entry 
ai of Vi equal to Ui. The remaining n — k entries of vj 
are set to 0. 

2) We calculate vn = vj • F®”". 

3) The vector vm is gotten from vn by setting all entries 
not in A to zero. 

4) We return x = vm ■ F®™. 

Clearly, steps [T] and |3] can be implemented very efficiently in 
any computational model. The interesting part is calculations 
of the form v ■ F®™, for a vector v of length n. 

As we will expand on later, the main merit of B is that 
the computation of v ■ F®™ can be done in parallel. Namely, 



Fig. 1. The systematic encoder of 0 for ™ 5) polar code. 


if V = (v',v"), where v' (respectively, v") equals the first 
(respectively, last) n/2 entries of v, then one can calculate 
v' • F®("*“f) and v" • F®(’”“f^ concurrently and then, by dH, 
combine the results to get 

[v" • F®*^™”^^]) . 

We also note that the systematic encoder in ||5l is easily de¬ 
scribed as two applications of a non-systematic encoder, with 
a zeroing operation applied in-between. Thus, any advances 
made with respect to non-systematic encoding of polar codes 
immediately yield advances in systematic encoding. 

Lastly, we state that both the encoder presented in H as 
well as the one presented in El produce the same codeword 
when given the same information vector. To see this, note 
that on the one hand, both encoders operate with respect to 
the same code of dimension k. That is, with respect to the 
same generator matrix G described above. On the other hand, 
by definition, both encoders produce the same output when 
restricted to the k systematic indexes A of the codeword. That 
is, to k indexes such that restricting the generator matrix G 
to them results in a k x k invertible matrix, as previously 
explained. Thus, the error-correction performance of a system 
utilizing the same decoder with either encoder remains the 
same 


III. Systematic, reversed, and non-reversed codes 


This section is devoted to recasting the concepts and opera¬ 
tion presented in the previous section into matrix terminology. 
We start by discussing a general linear code, and then special¬ 
ize to both non-reversed and reversed polar codes. Recalling 
the definition of S as the set of systematic indices, define the 
restriction matrix R = Rnxk corresponding to S as 


R — {Rij) 


n—lk—1 
i—0 j—0 5 


where Rij 


if i = Sj , 
otherwise . 


( 8 ) 


With this definition at hand, we require that a systematic 
encoder satisfy f (u) • F = u, or equivalently that 


II G-R = I , 


(9) 


where I above denotes the k x k identity matrix. Our proofs 
will center on showing that ® holds. 
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A. Non-reversed polar codes 

In this subsection, we consider a non-reversed polar code. 
Recall the definition in (|5ll of A being the set of active rows, 
where the jth smallest element of A is denoted aj. For this 
case, recall that we define S as equal to A and Sj as equal to 

aj. 

Define the matrix E as 


E={E,^,) 


k—ln—1 
i—0 j—0 ’ 


where 


Ei,j — 


if j = ai , 

otherwise . 


( 10 ) 

The matrix E will be useful in several respects. First, note by 
the above that applying E to the left of a matrix with n rows 
results in a submatrix containing only the rows indexed by A. 
Thus, we have that 


Gnr.=E-F^^, (11) 


where Gnrv is the generator matrix of our code, and “nrv” is 
short for “non-reversed”. 

Next, note that by applying E to the right of a vector u 
of length k, we manufacture a vector vi such that the entries 
indexed by aj equal uj and all other entries equal zero. That 
is, 

vi = u ■ E , 

as per step [T] of the algorithm described in Subsection III-BI 
Because of this property, we refer to E as the expanding 
matrix. 

Let us move on to step [3 of the algorithm. Simple algebra 
yields that 

viii = vii • E'^ ■ E . 


That is, multiplying a vector of length n from the right by 
E'^ ■ E results in a vector in which the entries indexed by A 
remain the same while the entries not indexed by A are set to 
zero. 

The above equations yield a succinct description of our 
algorithm, 

£:„,^(u) = \x E- E®^ ■ E^_ - E ■ . (12) 

n Gnrv 

We end this section by noting that by (l8]l and ( fTOl i, we have 
that 

E^ = R. 

Thus, recalling (|9]l, our aim is to prove that 

E ■ F®^ -E^ -E ■ F®^ ■E'^ = I . (13) 

Showing this will further imply that the corresponding 11 in 
(fT2l i is indeed invertible. 


B. Reversed polar codes 

As explained, we will consider bit-reversed as well as 
non-bit-reversed polar codes. Let us introduce corresponding 


notation. For an integer 0 < i < n, denote the binary 
representation of i as 

(^)2 — • 5 l) ) 

m —1 

where i = ij2^ and ij € {0,1} . (14) 

3=0 

For i as above, we define i as the integer with reversed binary 
representation. That is, 

m —1 

( ^ )2 — ? ^o) ? i = ^ ^ 2 ^ . 

As in El, we denote the n x n bit reversal matrix as 
Bn. Recall that Bn is a permutation matrix. Specifically, 
multiplying a matrix from the left (right) by Bn results in 
a matrix in which row (column) i equals row (column) i of 
the original matrix. 

Recall that we have denoted by A the set of active rows. 
We stress that this notation holds for both the reversed as 
well as the non-reversed setting. Thus, recalling (fTOl i. we have 
analogously to (fTTl l that 

Gr^=E- F®™ • Bn 

where Grv is the generator matrix of our code, and “rv” is 
short for “reversed”. By El Proposition 16], we know that 
Bn ■ F®^ = F®'^ ■ Bn. Thus, it also holds that 

G,^ = E-Bn- F®-^ . (15) 

In the interest of a lighter notation later on, we now “fold” 
the bit-reversing operation into the set A. Thus, define the set 
of bit-reversed active rows. A, gotten from the set of active 
rows A by applying the bit-reverse operation on each element 
ai. As before, we order the elements of A in increasing order 
and denote 

A = {PjfjZl , 0 < /3o < /3i < • • • < Pk-i < n- 1 . (16) 

Recall that the expansion matrix E was defined using A. We 
now define E = Ekxn according to A in exactly the same 
way. That is, 

£ = wh„e (.7) 

Note that E ■ B and E are the same, up to a permutation of 
rows (for i fixed, the reverse of ai does not generally equal 
Pi, hence the need for a permutation). Thus, by (fTsT i, 

G',^ = E- F®™ (18) 

is a generator matrix spanning the same code as Grv Anal¬ 
ogously to (fTST i. our encoder for the reversed code is given 
by 

£„{u) = u ■ F ■ F®’" ■ {Ef ■ E ■ F®""^ . (19) 

' n ^ 

We now highlight the similarities and differences with respect 
to the non-reversed encoder. First, note that for the reversed 
encoder, the set of systematic indices is A, as opposed to A for 
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the non-reversed encoder. Apart from that, everything remains 
the same. Namely, conceptually, we are simply operating the 
non-reversed encoder with A in place of A. Specifically, 
note that as in the non-reversed case, the encoder produces 
a codeword such that the information bits are embedded in 
the natural order. 

Analogously to ( fT3l l. our aim is to prove that 

E ■ F®™ • (Ef • E ■ F®™ • (F)^ = I. (20) 


IV. Domination contiguity implies involution 


In this section we prove that our encoders are valid by 
proving that (fTsT i and (l20l i indeed hold. A square matrix is 
called an involution if multiplying the matrix by itself yields 
the identity matrix. With this terminology at hand, we must 
prove that both F-F^'^-F^ and F-F®'"-(F)^ are involutions. 

Interestingly, and in contrast with the original systematic 
encoder presented in Q, the proof of correctness centers on 
the structure of A. That is, in a, any set of k active (non- 
frozen) channels has a corresponding systematic encoder. In 
contrast, consider as an example the case in which n = 4 and 
A = {0,1,3}. By our definitions. 


F = 


■ 1 0 0 o' 
0 10 0 
Lo 0 0 1 J 


’1 0 O' 


■ 1 0 0 O' 

0 10 

0 0 0 

, and F®2 = 

110 0 
10 10 

Lo 0 1 J 


1111 


Thus, 


F . F®2 . F^ = [ 1 1 0 
. 111 . 


and 


(F • F®2 . F^) • (F ■ F®^ • F^) 


■ 1 0 o' 
0 1 0 
L1 0 1J 


Note that the rightmost matrix above is not an identity matrix. 
A similar calculation shows that F ■ F®^ • (F)^ is not an 
involution either. 

The apparent contradiction to the correctness of our algo¬ 
rithms will be rectified in the next section. In brief, using 
terminology defined in Section [Vl the fact that A = {0,1, 3} 
implies that W'^ is frozen while W is unfrozen. However, 
this cannot correspond to a valid polar code since W'^ is 
upgraded with respect to W . 

We now characterize the A for which (fTST l and (l20l i hold. 
Recall our notation for binary representation given in (fT4ll . For 
0 < j < n, denote 


(*)2 — (*0)*1, • ■ • , 0)2 — ■ ■ ■ ,jm-l) ■ 

We define the binary domination relation, denoted F, as 
follows. 

j iff for all 0 < f < TO, we have it > jt ■ 


Theorem 1. Let the active rows set A C {0,1,..., n — 1} be 
domination contiguous, as defined in (123. Let E and E be 
defined according to (|5ll, ( 1701 ). ( 1761 ). and Uni. Then, E ■ F®™ ■ 
F^ and E ■ F®*” • (F)^ are involutions. That is, ( 1771 ) and ( 1201 ) 
hold. 

Proof. We first note that for 0 < i, j < n, we have that ij 
iff i 'IL j ■ Thus, if A is domination contiguous then so is A. 
As a consequence, proving that F • F®"* • F^ is an involution 
will immediately imply that F • F®”* • (F)^ is an involution 
as well. Let us prove the former—that is, let us prove (fTsT ). 

We start by noting a simple characterization of F®™, where 
F is defined as in (12l. Namely, the entry at row i and column 
j of F®™ is easily calculated; 

(F®™k, = |^ (22) 

I 0 otherwise . 

To see this, consider the recursive definition of given 
in (El). Obviously, (F®™)^ ^ equals 0 if we are at the upper 
right (n/2) x (n/2) block. That is, if im-i (the most- 
significant bit of i) equals 0 and jm-i equals 1. Next, consider 
the other three blocks and note that for them, i j iff 
i mod 2™“^ dominates j mod 2"*“^. Since the remaining 
blocks all contain the same matrix, it suffices to prove the 
claim for the lower left block. Thus, we continue recursively 
with i mod 2™“^ and j mod 2™“^. 

Recalling dS and the fact that |A| = k, we adopt the 
following shorthand; for 0 < p, q,r < k given, let 

h — CIp , t — eXq , J — OLj. . 

By the above, a straightforward derivation yields that 


(F • F®’" • F^)p,, = 

and (F • F®™ • = (F®™),j . 

Thus, 

f (F . F®™ ■ E^) ■ (F • F®™ ■ F^)'] 

\ / p,r 

k-1 

= ^(F • F®™ • F'^)p,q • (F • F®™ • E^)q^r 

q=Q 

= • (23) 

i^A 

Proving (fT3T l is now equivalent to proving that the right-hand 
side of (|23 T i equals 1 iff h equals j. Recalling (l22li . this is 
equivalent to showing that if h j, then there is an even 
number of i € A for which 


Namely, i h j iff the support of ( 7)2 (the indices t for which 
it = 1) contains the support of (j) 2 . 

We say that a set of indices A C {0,1,..., n — 1} is 
domination contiguous if for all h,j G A and for all 0 < 7 < n 
such that h y i and 7 F j, it holds that i G A. For easy 
reference; 

{h,j G A and h'^i'^ j) => i G A . (21) 


h F 7 and i h j , (24) 

while if h = j, then there is an odd number of such 7. 

We distinguish between 3 cases. 

1) If h = j, then there is a single 0 < i < n for which 
d24li holds. Namely, i = h = j. Since h, j G A, we have 
that 7 G A as well. Since 1 is odd, we are finished with 
the first case. 
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2) If h j and h ^ j, then there can be no i for which 
(l24l i holds. Since 0 is an even integer, we are done with 
this case as well. 

3) If h ^ j and h y j, then the support of the binary 

vector (j )2 = {jo,ji,---,jm-i) is contained in and 
distinct from the support of the binary vector {h )2 — 
{ho, hi ,..., hm-i)- A moment of thought reveals that 
the number of 0 < * < n for which (l24ll holds is 
equal to where w{h) and w{j) represent 

the support size of {h )2 and {j) 2 , respectively. Since 
h ^ j and h ^ j, we have that w{h) — w{j) > 0. Thus, 
2 w{h)-w{i) even. Since h,j G A and A is domination 
contiguous, all of the above mentioned i are members 
of A. To sum up, an even number of i G A satisfy (l24l l. 
as required. ■ 

Recall m Section X] that an (r, m) Reed-Muller code has 
length n = 2™ and is formed by taking the set A to contain 
all indices i such that the support of { 1)2 has size at least 
r. Clearly, such an A is domination contiguous, as defined 
in (l2Tli . Hence, the following is an immediate corollary of 
Theorem [T] and states that our encoders are valid for Reed- 
Muller codes. 

Corollary 2. Let the active row set A correspond to an (r, m) 
Reed-Muller code. Let E and E be defined according to 0, 
(Enil, and (IZZll, where n = 2”*. Then, E ■ F®™ • E^ and 
E ■ F®™ • (F)^ are involutions. That is, ( 1751 ) and ( 1201 ) hold 
and thus our two encoders are valid. 

V. Polar codes satisfy domination contiguity 

The previous section concluded with proving that our en¬ 
coders are valid for Reed-Muller codes. Our aim in this section 
is to prove that our encoders are valid for polar codes. In 
order to do so, we first define the concept of a (stochastically) 
upgraded channel. 

A channel W with input alphabet X and output alphabet 
y is denoted W '. X ^ y. The probability of receiving 
y Gy given that x G X was transmitted is denoted W{y\x). 
Our channels will be binary input, memoryless, and output 
symmetric (BMS). Binary: the channel input alphabet will 
be denoted as X = {0,1}. Memoryless: the probability of 
receiving the vector {yi)^EQ given that the vector {xi)^EQ 
was transmitted is Yri=oW{y^\x.). Symmetric: there exists 
a permutation it : y ^ y such that that for all y G y, 
7r(7r(2/)) = y and W{y\f)) = W{'K{y)\l). 

We say that a channel W ■. X ^ y \s upgraded with respect 
to a channel Q : X ^ Z \f there exists a channel $ : 3^ —(■ Z 
such that concatenating $ to kP results in Q. Formally, for all 
X G X and z G Z, 

Q{z\x) = ^ W{y\x) ■ $(z|y) . 
v&y 

We denote W being upgraded with respect to Q as W ^ Q. 
As we will soon see, using the same notation for upgraded 
channels and binary domination is helpful. 


Let IF : -T 3^ be a BMS channel. Let W- : X ^ y^ 
andW+ : X ^ y^ X X be the ‘ ‘minus” and “plus” transform 
as defined in 13. That is, 

W~{yo,yi\uo) = ^ ^ W{yo\uoui) ■ W{yi\ui) , 

«iG{0,l} 

W'^{yo,yi,uo\ui) = ^W{yo\uo -f Ml) -Wiyilm) . 

The claim in the following lemma seems to be well known in 
the community, and is very easy to prove. Still, since we have 
not found a place in which the proof is stated explicitly, we 
supply it as well. 

Lemma 3. Let W : X ^ y be a BMS channel. Then, W~^ is 
upgraded with respect to W~, 

W+ tW- . (25) 

Proof. We prove that F W and W F W~. Since “F” 
is easily seen to be a transitive relation, the proof follows. To 
show that 1F+ F W, take ^ : y^ x X —>■ y as the channel 
which maps (yo,yi,Uif) to yi with probability 1. We now 
show that W F W~. Recalling that IF is a BMS, we denote 
the corresponding permutation as tt. We also denote by S() 
a function taking as an argument a condition. The function 
S equals 1 if the condition is satisfied and 0 otherwise. With 
these definitions at hand, we take 


^{yo,yi\y) 

= ^[W{yi\0) ■ 6{yo = y) XW{yi\l) ■ 6{yo = Tt{y)\ . ■ 

This is a good place to note that our algorithm is applicable 
to a slightly more general setting. Namely, the setting of com¬ 
pound polar codes as presented in [HI- The slight alterations 
needed are left to the reader. 

The following lemma claims that both polar transformations 
preserve the upgradation relation. It is a restatement of ||9] 
Lemma 4.7]. 

Lemma 4. Let W : X —?■ y and Q : X ^ Z be two BMS 
channels such that W Q. Then, 

W- t Q~ and 1F+ F Q+ (26) 

For a BMS channel IF and 0 < i < n, denote by the 

channel which is denoted in 111. By |I3 Proposition 

13], the channel is symmetric. The following lemma 

ties the two definitions of the F relation. 


Lemma 5. Let W : X ^ y be a BMS channel. Let the 
indices 0 < i, j < n be given. Then, binary domination implies 
upgradation. That is, 

i^j . (27) 


Proof. We prove the claim by induction on m. For m = 1, 
the claim follows from either (l25T l. or the fact that a channel 
is upgraded with respect to itself, depending on the case. For 
m > 1, we have by induction that 


W 


(m-l) 

Li/2J 


F IF 


(m—1) 

b72j 
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Now, if the least significant bits of i and j are the same we 
use (l26l) . while if they differ we use (l25T l and the transitivity 
of the relation. ■ 

We are now ready to prove our second main result. 

Theorem 6. Let A be the active rows set corresponding to a 
polar code. Then, A is domination contiguous. 

Proof. We must first state exactly what we mean by a “polar 
code”. Let the code dimension k be specified. In 0, A 
equals the indices corresponding to the k channels with 

smallest Bhattacharyya parameter, where 0 < i < n. Other 
definitions are possible and will be discussed shortly. However, 
for now, let us use the above definition. 

Denote the Bhattacharyya parameter of a channel W by 
Z{W). As is well known, if W and Q are two BMS channels, 
then 

WhQ Z{W) < Z{Q) . (28) 

For a proof of this fact, see Eol. 

We deduce from ( |27] | and ( l28l l that if i A j, then 
Z{W^"^^) < Z(wJ^"^. Assume for a moment that the 
inequality is always strict when i h j and i j. Under 
this assumption, j € A must imply i € A. This is a stronger 
claim then (ISTT i. which is the definition of A being domination 
contiguous. Thus, under this assumption we are done. 

The previous assumption is in fact true for all relevant 
cases, but somewhat misleading: The set A is constructed by 
algorithms calculating with finite precision. It could be the 
case that i f j, i h j, but Z(W^^’^^) and Z(W^^’^^) are 
approximated by the same number (a tie), or by two close 
numbers, but in the wrong order. Thus, it might conceptually 
be the case that j is a member of A while i is not (in practice, 
we have never observed this to happen). These cases are easy 
to check and fix, simply by removing j from A and inserting 
i instead. Note that each such operation enlarges the total 
Hamming weight of the vectors {t )2 corresponding to elements 
t of A. Thus, such a swap operation will terminate in at most a 
finite number of steps. When the process terminates, we have 
by definition that if j € A and i A j, then i € A. Thus, A is 
dominations contiguous. 

Instead of taking the Bhattacharyya parameter as the figure 
of merit, we could have instead used the (more natural) 
channel misdecoding probability. That is, the probability of an 
incorrect maximum-likelihood estimation of the input to the 
channel given the channel output, assuming a uniform input 
distribution. Yet another figure of merit we could have taken 
is the channel capacity. The important point in the proof was 
that an upgraded channel has a figure of merit value that is no 
worse. This holds true for the other two options discussed in 
this paragraph. See mu Lemma 3] for details and references. 
We note that although these are natural ways of defining polar 
codes, they are in some cases sub-optimal. See for example 
m (it is easily proved that our encoder is valid for the scheme 
in ma as well). ■ 

A. Application to Shortened Codes 

We end this section by discussing two shortening proce¬ 
dures, mu and M, which are compatible with our sys¬ 


tematic encoder. Recall that shortening a code at positions 
r L {0,1,...,n — 1} means that only codewords x = 
{xq, xi,..., Xn-i) for which 7 G L implies x-y = 0 are part of 
the newly created code. Since the value of x at positions L is 
known to be 0, these codeword positions are not transmitted. 
Hence, a word of length n— |r| is transmitted over the channel. 

For our purposes, the polar shortening schemes ma and 
ifTrll are very similar. In ifTSll . the set F is defined as 

r = {7 : 7o < 7 < «} : 

for a given 70. The encoding in lEl is of the non-bit-reversed 
type. In contrast, the set F in lfT4l is obtained from the set F 
above by applying a bit-reversing operation. The encoding in 
M is bit-reversed as well. 

An important consequence of the above definition of F is the 
following. In both settings, the shortening is accomplished by 
freezing the corresponding indices in the information vector. 
That is, 7 G F implies that 7 ^ A. Also, in both settings, 
the “channel” corresponding to a position 7 G F (which 
is not transmitted) is taken as the noiseless channel when 
constructing the polar code. The rational is that we know with 
certainty that the value of x^ is 0. 

The applicability of our method to the above follows by 
two simple observations, which we now state without proof. 
Firstly, Lemma [U and its derivatives continue to hold in the 
setting in which the underlying channels may be of a different 
type. Specifically, note that at the lowest level, a plus or minus 
operation may involve a “real” channel and a “noiseless” 
channel. However, the natural analog of (l25T l continues to hold. 
Namely, a plus operation is still upgraded with respect to a 
minus operation. 

The second observation is that i A j as well as i A j imply 
that i > j. Thus, in this setting as well, domination contiguity 
continues to hold. Indeed, consider for concreteness the non- 
reversed case and suppose to the contrary that d^Tt does not 
hold. Namely, we have found A i Y j such that h,j G A 
but 7 ^ A. By our first observation, i ^ A must be the result of 
i being a shortened index, i G F. But if z is a shortened index, 
we have by our second observation that ft, is a shortened index 
as well. Hence, ft G F which implies that ft ^ A, contradiction. 

VI. Flexible Hardware Encoders 

The encoder discussed in the previous sections uses two 
instances—or two passes—of a non-systematic polar encoder 
to calculate the systematic codeword. Therefore it is important 
to have a suitable non-systematic encoder that provides its 
output in natural or bit-reversed order. 

A semi-parallel non-systematic polar encoder design with a 
throughput of V bit/cycle, where V corresponds to the level 
of parallelism, was presented in 03. However, it presents 
its output in pair bit-reversed order—the output is in bit- 
reversed order if a pair of consecutive bits is viewed as a single 
entity,—rendering it unsuitable for use with our systematic 
encoder. This also poses a problem for parallel and semi¬ 
parallel decoders, which expect their input either in natural 
or in bit-reversed order. 

We start this section by presenting the architecture for a 
new non-systematic encoder that presents its output in natural 


0 



Fig. 2. Architecture of the proposed semi-parallel non-systematic polar 
encoder with n = 16 and P = 4. 

order and has the same P-bit/cycle throughput and n/7^-cycle 
latency as HE). We show the impact of adding length flexibility 
support, and then utilize it as the core component of a flexible 
systematic encoder according to the algorithm discussed in 
this work. 

A. Non-Systematic Encoder Architecture 

Fig- m shows the proposed architecture for a non-systematic 
encoder with n = 16 and V = A, where stage boundaries 
are indicated using dashed lines. Each stage Si, with index 
i, applies the basic polar transformation to two input bits, 
j5i-i[j] and + 2*“^] that are 2*“^ bits apart in the 

polar code graph. Since the input pairs to stages with indices 
G [1, \ogV] are available in the same 7^-bit input and the same 
clock cycle, these stages are implemented using combinational 
logic only, as shown for Si and S 2 in the figure. On the other 
hand, the two bits processed simultaneously by a stage with 
an index i > log 7^ are not available in the same clock cycle, 
necessitating the use of delay elements, denoted D in Fig. |2l 
Such a stage is implemented using V 1-bit processing elements 
operating in parallel, each of which has delay 

elements. A processing element I contains a multiplexer that 
alternates its output between /3i-i[Vt + l — /3i-i[Vt+l] 

and Pi-ilVt + l] every clock cycles, where t is the 

current cycle index. 

The resulting encoder has a throughput of V bit/s, a latency 
of njV cycles, and a critical path that passes from u[4/-|-4] to 
x[At\, similar to the encoder of lITSl . The critical path can be 
shortened by inserting pipeline registers at stage boundaries, 
increasing latency in terms of cycles, but leaving throughput 
per cycle unaffected. In addition to the output order, the 
proposed architecture has another advantage over 021 in that 
it can be used to implement a fully serial encoder with T’ = 1, 
whereas that of m can only scale down to V — 2. We 
note that throughout this work, encoding latency is measured 
from first data-in to first data-out and all encoders start their 
operation as soon as the first V input bits are available. 

B. Flexible Non-Systematic Encoder 

Since the input u is assumed to contain ‘O’s in the frozen 
bit locations, the proposed encoder is rate flexible as the input 



■x[Vt, ■ ■ ■ 


,Vt + V -1] 


logf#! 

Fig. 3. Flexible encoder with maximum code length nmax and parallelism V. 


preprocessor can change the location and number of frozen 
bits without affecting the encoder architecture. 

Adapting this architecture to encode any polar code of 
length n < rimax requires extracting data from different stage 
outputs—indicated using the dashed lines in Fig. |2] —in the 
encoder. The output for a code of length n can be extracted 
from location S\ogn using V instances of a logrimax x 1 
multiplexer. 

The width of the multiplexer can be reduced to log rimax — 
logT’ without affecting decoding latency by exploiting the 
combinational nature of stages G [S'!, 5'iog7?] and setting inputs 
with indices i > n—1 to ‘O’. The modified encoder architecture 
is illustrated in Fig. [3 where the block labeled ‘encoder’ is 
the non-systematic encoder shown in Fig. |2] The AND gates 
are used to mask inputs when n <V. The first AND gate, &o, 
will always have it second input set to ‘1’ in this case and will 
be optimized away by the synthesis tool. It is shown in the 
figure because it will be used to implement code shortening 
as described in Section [Vl-FI 


C. Non-Systematic Encoder Implementation 

Both the rate-flexible and rate and length-flexible versions 
of the proposed non-systematic encoder were implemented on 
the Altera Stratix IV EP4SGX530KH40C2 FPGA. We also 
implemented the encoder of on the same FPGA for 
comparison even though its output order is not suitable for 
implementing the proposed systematic encoder. All decoders 
have a latency of 512 cycles and coded throughput of 32 
bits/cycle for rimax = 16384 and V — 32. Table |I] presents 
these implementation results, where the proposed rate-flexible 
and the rate and length-flexible encoders are denoted R- 
flexible and i?n-flexible, respectively. From the table it can be 
observed that including length flexibility increases the logic 
requirements of the design by 27% due to the extra routing 
required. It also decreases the maximum achievable frequency, 
and in turn throughput, by 14%. The latency of the decoders is 
bounded by rimax/T^ and increasing V to 64 reduced it to 0.74 
and 0.82 /rs, increasing the throughput to 22 and 20 Gbps, for 
the R- and i?n-flexible encoders, respectively. The maximum 
achievable frequency decreased to 344 and 313 MHz for the 
two encoders. 
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TABLE I 

Implementation oe the proposed _R-elexible and Rn-ELEXiBLE 
NON-SYSTEMATIC ENCODERS COMPARED WITH THAT OF (H) FOR 
rtmax = 16384 AND "P = 32 ON THE ALTERA STRATIX IV 
EP4SGX530KH40C2. 


Decoder 

LUTs 

FF 

RAM 

(bits) 

/ 

(MHz) 

Lat. 

(/is) 

T/P 

(Gbps) 

QD 

769 

1,392 

12,160 

354 

1.4 

11.3 

R-flexible 

649 

1,240 

12,160 

394 

1.3 

12.6 

iln-flexible 

838 

1,293 

12,160 

360 

1.4 

11.5 


The results were obtained using Altera Quartus 11 15.0 
and verified using both RTL and gate-level simulation with 
randomized testbenches. 


D. Systematic Encoder Architecture 

With a non-systematic encoder providing its output in a 
suitable order, we now present the architecture and imple¬ 
mentation results for the proposed systematic encoder. As 
proved in Sections Hill and UVl the proposed systematic encoder 
can present its output with parity bits in bit-reversed or 
natural order locations—even with the non-systematic encoder 
providing its output in natural order—by changing the location 
of the frozen bits. We therefore use an rimaxI'T’ x 7^-bit memory 
to store the frozen-bit mask, enabling the encoder to support 
both parity-bit locations, in addition to rate flexibility. 

As mentioned in Section III-BI the systematic-encoding 
process performs two non-systematic encoding passes on the 
data. These passes can be implemented using two instances 
of the proposed non-systematic encoder. The output of the 
first is stored in registers and then masked according the 
content of the mask memory before being passed to another 
level of pipeline registers to limit the critical path length. 
The output of the registers is then passed to the second non- 
systematic encoder instance, whose output forms the system¬ 
atic codeword. Such an architecture has the same 7^-bit/cycle 
throughput of the component non-systematic encoder with a 
latency C = 2£ns + 2 cycles, where £ns is the latency of the 
non-systematic encoder. 

Alternatively, to save implementation resources at the cost 
of halving the throughput, one instance of the component 
encoder can be used for both passes. The output of the non- 
systematic encoder is stored in registers after the first pass 
and is routed back to the input of the encoder. The systematic 
codeword becomes available after the second pass. 

The systematic encoder of 13 can be used in a configuration 
similar to the proposed high-throughput one. However, it 
requires multiplication by matrices that change when the 
frozen bits are changed. Therefore, its implementation requires 
a configurable parallel matrix multiplier that is significantly 
more complex than the component non-systematic encoder 
used in this work. When the encoder of il is implemented 
to be rate-flexible and low-complexity, it has a latency of at 
least n clock cycles; compared to the 2n/7^ -|- 2 cycle latency 
of the proposed architecture. 


TABLE II 

Implementation of the proposed R-flexible and Rn-FLEXiBLE 
SYSTEMATIC ENCODERS FOR rimax = 16384 AND R = 32 ON THE 
Altera Stratix IV EP4SGX530KH40C2. 


Decoder 

LUTs 

FF 

RAM 

(bits) 

/ 

(MHz) 

Lat. 

(/is) 

T/P 

(Gbps) 

Non-Pipelined 

i?-fiexible 

1,442 

2,320 

36,924 

206 

5.0 

6.6 

i?n-flexible 

1,782 

2,381 

36,924 

180 

5.7 

5.7 

Pipelined 

i?-flexible 

1,397 

2,639 

36,924 

282 

3.6 

9.0 

i?n-flexible 

1,606 

2,742 

36,924 

264 

3.9 

8.4 


TABLE III 

Implementation of the proposed Rn-FLEXiBLE systematic 

ENCODER for DIFFERENT rimax AND V VALUES ON THE ALTERA 
Stratix IV EP4SGX530KH40C2. 



V 

LUTs 

FF 

RAM 

(bits) 

/ 

(MHz) 

Lat. 

(/is) 

T/P 

(Gbps) 

16,384 

32 

1,606 

2,742 

36,924 

264 

3.9 

8.4 

16,384 

64 

2,872 

5,287 

16,384 

235 

2.2 

15.0 

16,384 

128 

4,404 

8,304 

16,384 

272 

0.9 

34.8 

32,768 

32 

1,971 

2,997 

85,948 

258 

7.9 

8.2 

32,768 

64 

3,390 

5,601 

64,200 

265 

3.9 

16.9 

32,768 

128 

5,550 

10,024 

37,304 

234 

2.2 

29.9 


E. Systematic Encoder Implementation 

Implementation results of the throughput oriented ii-flexible 
and i?n-flexible encoders are presented in Table [H] both 
with and without pipeline registers in between the two non- 
systematic encoder instances. In the pipelined version two 
levels were used: one before and one after the masking 
operation, since memory access incurred a comparatively long 
delay. The results show that the pipelined version performs 
significantly faster than the non-pipelined version, where the 
clock frequency was increased by 80 MHz for both pipelined 
encoders and was limited by clock and asynchronous reset 
distribution. The pipelining yielded throughput values of 9 
and 8.4 Gbps for the i?-flexible and i?n-flexible encoders, 
respectively. The reported amount of RAM included the mask 
memory, in addition to operations that were converted auto¬ 
matically by the synthesis and mapping tools. 

As in the case of the non-systematic encoder, the throughput 
is proportional to V and the latency to n/P. Table [HI] 
explores the effect of different rimax and V values on the 
pipelined i?n-flexible encoder. Throughput in excess of 10 
Gbps is achievable by the encoder when V > 32. When 
n < Umax, throughput remains unchanged and latency de¬ 
creases to njrtmax of its original value. For example, when the 
encoder with rimax = 16384 and P = 64 encodes a code with 
n = 2048, throughput remains 15 Gbps and latency decreases 
to 281 ns. 

F. On Code Shortening 

As discussed in Subsection IV-Al the works in IH and 
m describe shortening schemes for polar codes in which 
the last n — ns information bits in a polar code of length 
n are replaced with ‘O’s. Those bits are discarded from the 
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TABLE IV 

Implementation oe the proposed i?n-FLEXiBLE systematic 

ENCODER WITH SHORTENING ON THE ALTERA STRATIX IV 
EP4SGX530KH40C2. 


^^max 

V 

LUTs 

FF 

RAM 

/ 

Lat. 

T/P 





(bits) 

(MHz) 

(ps) 

(Gbps) 

16,384 

128 

4,518 

8,667 

16,384 

272 

0.9 

34.8 


systematic codeword before transmission. The result is that 
ns bits containing kg information bits are transmitted; where 
kg = k — {n — Ug). These schemes are suitable for use 
with the proposed systematic encoder, yielding a system that 
can encode normal and shortened polar codes of any length 
n G [2, Umax] without any other constraints on the code length 
or rate. 

To adapt our proposed systematic encoder and enable short¬ 
ening, the second input, en^, to the AND gates &i becomes 

1 when n > [('Pf + z)/2j, 

0 otherwise. 

Adding code shortening ability has a minor effect on the 
resource utilization of the i?n-flexible encoder as can be 
observed in Table HV] 

VII. Flexible Software Encoders 

In this section, we present a software implementation of 
our systematic encoder using single-instruction multiple-data 
(SIMD) operations. We use both AVX (256-bit) and SSE 
(128-bit) SIMD extensions, in addition to the built-in types 
uint8_t, uintl6_t, uint32_t, and uint64_t to op¬ 
erate on multiple bits simultaneously. The width of the selected 
type determines the encoder parallelism parameter V, e.g. 
V = 8 for uint8_t. 

The component non-systematic encoder progresses from 
stage Si to S'log« and presents its output in natural order. The 
input to Si is a packed vector where bits corresponding to 
frozen locations are set to ‘0’ and information bits are stored 
in the other locations. The bit with index t at the output of a 
stage Si is calculated according to: 

— 2*“^] ©when [f/2*“^J is even, 
I3i-i[t] otherwise. 

(29) 

This operation is applied directly to V bits simultaneously in 
stage Si if 2^-l > p 

However, since we can only read and 
write data in groups of V bits whose addresses are aligned to 
P-bit boundaries, operations in stages Si with 2®“^ < P are 
performed using a mask-shift-XOR procedure. A P-bit mask 
mi is generated for each stage G [S'!, S'log n] so that: 

fo when U/2®“^ I is even, 

"••1*1 = |l „,he,wle. 

The output for these stages is calculated using: 

[f : f + P — 1] =(3i—i [f : f + P — 1]© 

{{Pi-i [t:t + V -l]k mi) » 2*“^). 


TABLE V 

Latency and coded throughput of a software systematic 
ENCODER WITH n = 32, 768 AND DIFFERENT V VALUES RUNNING ON AN 
Intel Core 17-2600. 


V 

8 

16 

32 

64 

128 

256 

Latency (//s) 

64.1 

30.1 

14.3 

7.7 

4.1 

3.3 

T/P (Gbps) 

0.5 

1.1 

2.3 

4.2 

8.0 

10.0 


The index t starts at 0 and is incremented by P with a final 
value of N—V. The group of P bits with indices G [f, f+P—1] 
is denoted t : t + V — 1. The symbol & is the bit-wise binary 
AND operation, and >> is the logical bit right shift operator. 

Since SSE operations lack bit shift operations, but include 
byte shifts, operations for stages Si,S 2 ,S 3 are performed 
using the uint64_t native type in the proposed software 
encoder. AVX version 1 does not provide any shift operations, 
and version 2 can only perform byte-shifts within 128-bit 
lanes. Therefore, we use SSE instructions until stage Sg, where 
the encoder switches to using AVX operations. The masking 
operation between the two non-systematic encoding passes is 
applied using P-bit operations and masks. 

The resulting software systematic encoder operates on data 
in-place and requires n bits of additional memory to store 
the frozen-bit mask, and another P log P bits to store the 
stage masks. The latency and coded throughput values for the 
proposed software systematic encoder running on a 3.4 GHz 
Intel Core 17-2600 are shown in Table |V] for n = 32, 768. 
P was varied between 8 and 256. It can be seen that the 
latency decreases linearly with increasing P until P = 128. 
The latency only decreases by 20% between the SSE (128-bit) 
and AVX (256-bit) encoders for two reasons: the use of SSE 
for stages to up Sg in the AVX encoder, and the overhead of 
loops and conditionals in the encoder. As a result, an encoder 
specialized for a given n value is expected to be faster. 

The speed results indicate that even an embedded 8-bit 
micro-processor running 1000 times slower would still be 
capable of transmissions at 500 kbps, eliminating the need 
for a dedicated hardware encoder for many applications such 
as remote sensors and some internet of things devices. 

VIII. Elexible Hardware Decoders 

We complete the flexible hardware polar coding system 
in this section by presenting flexible, systematic hardware 
decoder, which can decode channel messages based on the 
codewords generated by the encoder presented in Section [Vl] 
As discussed in Q, it is important that the parity bits be in bit- 
reversed locations to reduce routing complexity and simplify 
memory accesses. 

The original East-SSC decoder was capable of decoding 
all polar codes of a given length: it resembled a processor 
where the polar code is loaded as a set of instructions 0. 
In this section, we review the East-SSC algorithm, describe 
the architectural modifications necessary to decode any polar 
code up to a maximum length rimax and analyze the resulting 
implementation. 
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SPC 

(b) Fast-SSC Tree 


Fig. 4. The graph of an (8, 4) polar code and its corresponding Fast-SSC 
tree representation. 


A. The Fast-SSC Decoding Algorithm 

Our proposed flexible decoders utilize the Fast-SSC pre¬ 
sented in 0. The polar code is viewed as a tree that cor¬ 
responds to the recursive nature of polar code construction: 
a polar code of length n is the concatenation of two polar 
codes of length n/2. In the successive cancellation decoding, 
the tree is traversed depth first starting from stage S'logn until 
leaf node in stage Sq, corresponding to a constituent code of 
length n = 1 is reached. At that point, the output of is ‘0’ 
if the leaf node corresponds to a frozen bit. Otherwise, it is 
calculated from the input log likelihood ratio (LLR) based 
on threshold detection. The SSC decoding algorithm, ca, 
directly decodes constituent codes of any length that are of 
rate 0 or rate 1 without traversing their sub-trees. The Fast- 
SSC algorithm directly decodes single parity check (SPC) and 
repetition codes, in addition to rate-0 and rate-1 codes, of any 
length. Fig. 0] shows the Fast-SSC tree of an (8, 4) polar code, 
where the flow of messages is indicated by the arrows. 

B. Stage Indices and Sizes 

The Fast-SSC decoder in 0 starts decoding a polar code 
of length Umax at stage where a stage Si corre¬ 

sponds to a constituent polar code of length 2*, as discussed 
in Section IVIII-AI Since the decoder uses a semi-parallel 
architecture, the length of the constituent code is used to 
determine the number of memory words associated with a 
stage. The simplest method for that decoder to decode a code 
of length n < rimax is to store the channel LLRs in the memory 
associated with stage Sn and start the decoding process from 
there. This, however, requires significant routing resources 
since the architecture presented in 0 separates the channel 
and internal LLRs into different memories for performance 
reasons. 

In the proposed flexible decoder, we calculate the length, 
of the constituent code associated with a stage Si as function 
of i, n, and rimux' 

77 

n,{S,)=T -. (30) 

^max 

The memory allocated for a stage Si remains the same re¬ 
gardless of n and is always calculated assuming n = rimax- The 
flexible decoder always starts from 5'iog n„ax ’ corresponding to 
a polar code of length n < rimax, performing operations on 


TABLE VI 

Implementation oe the iJn-FLEXiBLE polar decoder compared to 
THE i?-ELEXIBLE DECODER OF (5) ON THE ALTERA STRATIX IV 
EP4SGX530KH40C2. 


Decoder 

n 

V 

LUTs 

EF 

RAM 

/ 






(bits) 

(MHz) 

m 

2048 

64 

6315 

1608 

50,072 

102 

Proposed 

2048 

64 

6507 

1600 

50,072 

102 

w/ Shortening 

2048 

64 

6451 

1613 

52,120 

102 

m 

32,768 

256 

24,066 

7,231 

536,136 

102 

Proposed 

32,768 

256 

23,583 

7,207 

536,136 

102 

w/ Shortening 

32,768 

256 

23,593 

7,219 

568,904 

102 


nv{Si)/{27^) LLR values at a stage Si, and proceeds until it 
encounters a constituent code whose output can be directly 
estimated according to the rules of the Fast-SSC algorithm. 

C. Implementation Results 

Since memory is accessed as words containing multiple 
27^ LLR or bit-estimate values, the limits used to determine 
the number of memory words per stage must be changed 
to accommodate the new n value. The rest of the decoder 
implementation remains unchanged from 0. These limits 
are now calculated according to (l30l) and using the n value 
provided to the decoder as an input. 

Table |Vl] compares the proposed flexible decoder (rimax = 
32,768) with the Fast-SSC decoder of 0 (n = 32,768) 
when both are implemented using the Altera Stratix IV 
EP4SGX530KH40C2 FPGA. The resource requirements are 
also provided for Umax = n = 2048. It can be observed that 
the change in resource utilization is negligible as a result of 
the localized change in limit calculations. 

When decoding a code of length n < rimax, the flexible 
decoder has the same latency (in clock cycles) as the Fast-SSC 
decoder for a code of length n. Since our i?n-flexible decoder 
implementation has the same operating clock frequency as the 
i?-flexible decoder, it also has the same throughput and latency 
(in time). We note that the decoders presented in this work 
contain an additional input buffer to store an incoming channel 
vector while one is being decoded. This is done to enable 
loading-while-decoding and allows the decoder to sustain its 
throughput. 

The implementation results for i?n-flexible decoder support¬ 
ing code shortening are also included in Table IVII The main 
change is the requirement for Umax more bit of RAM. This is a 
consequence of shortening being implemented using masking 
where the LLRs corresponding to shortened bits are replaced 
with the maximum LLR value based on an iimax-bit mask that 
is stored in said memory. 

IX. Flexible Soetware Decoders 

High-throughput software decoders require vectorization 
using SIMD instructions in addition to a reduction in the 
number of branches. However, these two considerations signif¬ 
icantly limit the flexibility of the decoder to the point where 
the lowest latency decoders in literature are compiled for a 
single polar code Oil. In this section, we present a software 
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Fast-SSC decoder balancing flexibility and decoding latency. 
The proposed decoder has 37% higher latency than a fully 
specialized decoder, but can decode any polar code of length 
n < Tiniax- As will be discussed later in this section, there are 
two additional advantages to the proposed flexible software 
decoder; the resulting executable size is an order of magnitude 
smaller, and it can be used to decode very long polar codes 
for which an unrolled decoder cannot be compiled. Since 
SIMD instructions operate mostly ‘vertically’ on data stored 
in different vectors, natural indexing is preferable to reversed 
one; in contrast to the hardware decoders. 

A. Memory 

Unlike in hardware decoders, it is simple to access an 
arbitrary memory location in software decoders. The LLR 
memory in the proposed software decoder is arranged into 
stages according to constituent code sizes. When a code of 
length n < n^nx is to be decoded, the channel LLRs are 
loaded into stage S'logn, bypassing any stages with a larger 
index. 

When backtracking through the code tree towards stages 
with high indices, the decoder performs the same operations 
on bit-estimates as the encoder—namely, binary addition and 
copying, depending on the index of the output bit in question. 
Storing the bit-estimates in a one-dimensional array of length 
n-max bits enables the decoder to only perform the binary 
addition and store its results, eliminating superfluous copy 
operations and decreasing latency El. For a code of length 
n < rimax, the decoder writes its estimates starting from bit 
index 0. Once decoding is completed, the estimated codeword 
will occupy the first n bits of the bit-estimate memory, which 
are provided as the decoder output. 

B. Vectorization 

The unrolled software decoder El specifies input sizes for 
each command at compile time. This enables SIMD vector¬ 
ization without any loops, but limits the decoder to a specific 
polar code. To efficiently utilize SIMD instructions while 
minimizing the number of loops and conditionals, we employ 
dynamic dispatch in the proposed decoder. Each decoder 
operation in implemented, using SIMD instructions and C-H- 
templates, for all stage sizes up to rimax- These differently sized 
implementations are stored in array indexed by the logarithm 
of the stage size. Therefore two branch operations are used: 
the first to look up the decoding operation, and the second to 
look up the correct size of that operation. This is significantly 
more efficient than using loops over the SIMD word size. 

C. Results 

We compare the latency of the proposed vectorized flexible 
decoder with a non-vectorized version and with the fully 
unrolled decoder of El using floating-point values. 

Table IVIII compares the proposed flexible, vectorized de¬ 
coder with a flexible, non-explicitly-vectorized decoder (de¬ 
noted ‘scalar’) and a fully unrolled (denoted ‘unrolled’) one 
running on an Intel Core 17-2600 with AVX extensions. All 


TABLE VII 

Speed of the proposed vectorized decoder compared with that 
OF non-vectorized and fully-unrolled decoders when 


n — riMAx 

= 32768 AND k 

= 29492. 

Decoder 

Latency (/is) 

Info. Throughput (Mbps) 

Scalar Fast-SSC 

256 

115 

Unrolled East-SSC flTl 

109 

270 

Proposed Fast-SSC 

149 

198 


TABLE VIII 

Speed of the proposed vectorized decoder compared with that 
OF non-vectorized and fully-unrolled decoders for a (2048, 
1723) code and riMAx = 32768. 


Decoder 

Latency (/is) 

Info. Throughput (Mbps) 

Scalar Fast-SSC 

16.2 

106 

Unrolled East-SSC flTl 

4.0 

430 

Proposed Fast-SSC 

11.1 

155 


decoders were decoding a (32768, 29492) polar code using 
the Fast-SSC algorithm, floating-point values, and the min- 
sum approximation. The flexible decoders had rimax = 32, 768. 
From the results in the table, it can be seen that the vectorized 
decoder has 42.6% the latency (or 2.3 times the throughput) 
of the scalar version. Compared to the code-specific unrolled 
decoder, the proposed decoder has 137% the latency (or 73% 
the throughput). In addition to the two layers of indirection in 
the proposed decoder, the lack of inlining contributes to this 
increase in latency. In the unrolled decoder, the entire decoding 
flow is known at compile time, allowing the compiler to inline 
function calls, especially those related to smaller stages. This 
information is not available to the flexible decoder. 

Results for n < rimax are shown in Table I Vlill where rimax = 
32, 768 for the flexible decoders and the code used was a 
(2048, 1723) polar code. The advantage the vectorized decoder 
has 68% the latency of the non-vectorized decoder. The gap 
between the proposed decoder and the unrolled one increases 
to 2.8 times the latency. These decrease in relative performance 
of the proposed decoder is a result of using a shorter code 
where a smaller proportion of stage operations are inlined. 

In addition to decoding different codes, the proposed flex¬ 
ible decoder has an advantage over the fully unrolled one in 
terms of resulting executable size and the maximum length 
of the polar code to be decoded. The size of the executable 
corresponding to the proposed decoder with rimax = 32, 768 
was 0.44 MB with 3 KB to store the polar code instructions 
in an uncompressed textual representation; whereas that of the 
unrolled decoder was 3 MB. In terms of polar code length, 
the GNU C-H- compiler was unable to compile an unrolled 
decoder for a code of length 2^^ even with 32 GB of RAM; 
while the proposed decoder did not exhibit any such issues. 

X. Conclusion 

In this work, we studied the flexibility in code rate and 
length of polar encoders and decoders. We proved the correct¬ 
ness of a flexible, parallelizeable systematic polar encoding al¬ 
gorithm and used it to implement high-speed, low-complexity 
hardware encoders with throughput up to 29 Gbps on FPGA. 
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The proof of correctness was provided not only for polar, 
but also for Reed-Muller codes. Software encoders were also 
presented and shown to achieve throughput up to 10 Gbps. We 
demonstrated rate and length flexible hardware decoders that 
had similar implementation complexity and the same speed 
as their rate-only flexible counterparts. Finally, we introduced 
software decoders that are flexible and able to achieve 73% 
the throughput of their unrolled, code-specific counterparts. 
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